Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.3 - Creating New Columns")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None

Creating New Columns and Transforming Data

When we are data wrangling, transforming data, we will using assign the result to a new column. We will explore the withColumn() function and other transformation functions to achieve this our end results.

We will also look into how we can rename a column with withColumnRenamed(), this is useful for making a join on the same column, etc.

Case 1: New Columns - withColumn()

(
    pets
    .withColumn('nickname_copy', F.col('nickname'))
    .withColumn('nickname_capitalized', F.upper(F.col('nickname')))
    .toPandas()
)
id breed_id nickname birthday age color nickname_copy nickname_capitalized
0 1 1 King 2014-11-22 12:30:31 5 brown King KING
1 2 3 Argus 2016-11-22 10:05:10 10 None Argus ARGUS
2 3 1 Chewie 2016-11-22 10:05:10 15 None Chewie CHEWIE

What Happened?

We duplicated the nickname column as nickname_copy using the withColumn() function. We also created a new column where all the letters of the nickname are capitalized with chaining multiple spark functions together.

We will look into more advanced column creation in the next section. There we will go into more details what a column expression is and what the purpose of F.col() is.

Case 2: Renaming Columns - withColumnRenamed()

(
    pets
    .withColumnRenamed('id', 'pet_id')
    .toPandas()
)
pet_id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None

What Happened?

We renamed and replaced the id column with pet_id.

Summary

  • We learned how to create new columns from old ones by chaining spark functions and using withColumn().
  • We learned how to rename columns using withColumnRenamed().

results matching ""

    No results matching ""